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ABSTRACT 

Motivation: Assembling genomes from short read data has become 
increasingly popular, but the problem remains computationally 
challenging especially for larger genomes. We study the scaffolding 
phase of sequence assembly where preassembled contigs are 
ordered based on mate pair data. 

Results: We present MIP Scaffolder that divides the scaffolding 
problem into smaller subproblems and solves these with mixed 
integer programming. The scaffolding problem can be represented 
as a graph and the biconnected components of this graph can be 
solved independently. We present a technique for restricting the size 
of these subproblems so that they can be solved accurately with 
mixed integer programming. We compare MIP Scaffolder to two state 
of the art methods, SOPRA and SSPACE. MIP Scaffolder is fast and 
produces better or as good scaffolds as its competitors on large 
genomes. 

Availability: The source code of MIP Scaffolder is freely available at 

http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/. 

Contact: leena.salmela@cs.helsinki.fi 
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1 INTRODUCTION 

Present high-throughput sequencing machines can produce hundreds 
of millions of short reads in a single run. Initially, this technology 
was targeted at resequencing applications but is nowadays also 
deployed in de novo sequencing projects. Apart from single reads, 
the sequencing machines can also produce mate pairs, i.e. pairs of 
reads whose approximate distance in the target genome is known. 
Although there are good methods to assemble reads produced by the 
older Sanger technology, new approaches are needed to deal with 
data from the second-generation sequencing machines, because the 
characteristics of the data like read length and sequencing errors are 
different (Pop, 2009). 

In a de novo sequencing project, the read data is typically 
first quality trimmed and filtered to ensure high-quality data for 
subsequent processing. A method to correct sequencing errors may 
also be used. The next step is to assemble the reads into contigs, 
which are gapless sequences of nucleotides. In the last step, the 
contigs are ordered into scaffolds based on mate pair reads (Pop, 
2009). 

Here we consider the scaffolding problem. The input to the 
scaffolding problem is the contigs produced by a contig assembler, 
the mappings of the mate pair reads to the contigs and the insert 
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sizes of the mate pair libraries. The objective is to find a linear 
ordering of the contigs, which maximizes the number of mate pairs 
whose pairwise distance equals the insert size. In practise, a linear 
ordering of all the contigs is not achieved, because the data may 
be incomplete and the organism may have several chromosomes. 
Instead each contig is assigned to a scaffold and given an orientation 
and position within the scaffold. 

Kececioglu and Myers (1995) have shown that even determining 
the orientation of the contigs is NP-hard. Therefore, all practical 
methods to solve the scaffolding problem use heuristics and achieve 
only an approximate solution. 

Many assemblers like Velvet (Zerbino and Birney, 2008), Allpaths 
(Butler et al, 2008) and SOAPdenovo (Li et al, 2010) contain a 
scaffolding module. Some stand-alone scaffolders have also been 
developed. Bambus (Pop et al, 2004) is designed for Sanger data, 
and SOPRA (Dayarian et al, 2010) and SSPACE (Boetzer et al, 
201 1) are developed for second-generation sequencing data. Bambus 
and SSPACE are based on a greedy method, whereas SOPRA 
relies on statistical optimization and partitioning the scaffolding 
problem. Our new approach uses a partitioning scheme similar to 
SOPRA but unlike SOPRA we restrict the size of the partitions 
and thus we can solve the subproblems exactly with mixed integer 
programming. 

The rest of the article is outlined as follows. In Section 2, we 
present a theoretical framework for scaffolding based on mixed 
integer programming and partitioning the scaffolding problem. 
Section 3 presents a practical implementation of this framework. We 
compare the new scaffolder to SOPRA and SSPACE in Section 4. 
Finally, we conclude in Section 5. 

2 ALGORITHM 

The input to a scaffolder is a set of contigs produced by an assembler 
and a set of mate pairs. First, we map the mate pairs to the contigs. 
This yields links between contigs that we represent as a graph. The graph 
is then partitioned into subproblems, which we solve with mixed integer 
programming. Finally, we combine the solutions of the subproblems to solve 
the whole scaffolding problem. Each of these tasks is discussed in more 
detail below. In Section 3, we address practical issues in implementing our 
approach. 

2.1 Scaffolding graph 

The input to the scaffolding problem can be represented as a bidirected 
graph called the scaffolding graph. Each contig is represented by a node 
in the graph. There is an edge between two contigs if there are mate pairs 
linking them. The edge is directed at both endpoints indicating the orientation 
of the contigs with regard to each other. Figure 1 shows the four possible 
ways of linking two contigs. Here the mate pair ends are oriented as in 
SOLiD data, where both reads of the pair are from the same strand of the 
genome. 
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Fig. 1. The four possible ways of linking two contigs with a mate pair and 
the bidirected edge representing each case. 



Fig. 2. The two possible placements of contigs i and j when an edge 
with orientation corresponding to the top left case in Figure 1 is used in 
a scaffolding. The top figure shows the case when both contigs are forward 
oriented and the bottom one shows the case when they are reverse oriented. 



Several mate pairs can induce the same edge in the scaffolding graph. The 
total number of these mate pairs is the support Sk of the edge k. These mate 
pairs are used to estimate the distance dk between the two contigs. 

The scaffolding problem can now be formulated as removing a set of edges 
with minimum combined support from the scaffolding graph such that each 
node can be assigned a position and orientation that satisfy the constraints 
imposed by the remaining edges. Each scaffold can then be constructed in 
two equivalent ways: the other way can be obtained by reversing the whole 
scaffold. 

2.2 Mixed integer programming formulation 

In this section, we will develop a mixed integer programming (MIP) 
formulation for the scaffolding problem. As a convention, we will use 
italic type to denote variables and roman type to denote constants for those 
quantities that are present in the MIP formulation. There are n contigs. Let us 
denote the length of contig i by a,-. Let L be an estimate of the upper bound 
of possible scaffold length (e.g. an estimated upper bound of the length of 
the entire genome or the longest chromosome). 

The solution to the MIP problem will assign values to the following 
variables. For each contig i (1 < i < n), we have two integer-valued variables: 

• xi e {1, ...,L}: the location of contig i in its scaffold. The location is 
always the position that aligns with the beginning of the contig. So 
if the contig aligns to the scaffold in reverse, the location is still the 
position where the beginning of the contig aligns. See Figure 2 for 
examples. 

• Oi e {0, 1}: the orientation of the contig, 0 meaning reverse orientation 
and 1 forward orientation. 

Additionally, there is one real-valued variable for each edge in the 
scaffolding graph: 

• 7*6 [0,1]: how well the distance constraint imposed by edge k is 
satisfied, h = 1 means that in the solution the distance between contigs 
of edge k is exactly d^. 

We have thus 2n-\-m variables in total where m is the number of edges in 
the scaffolding graph. 

Each edge k imposes a set of constraints. We give here the constraints for 
an edge where the two contigs, i and j, are oriented and positioned with regard 
to each other according to the top left case in Figure 1 . Figure 2 shows the 
two possible placements of the contigs in a scaffold so that the mate pairs 
are satisfied. We see that here both contigs are either forward oriented or 
reverse oriented. Thus, Oi = Oj and we get the following constraints: 

d-Oj-il-h) < 0 (1) 

Oi-Oj + a-h) > 0 (2) 

where the indicator variable Ik is used to relax the constraints because if 
the mate pairs are not satisfied in the solution, then the orientations may not 
match either. 



From the top case of Figure 2, we see that the following should hold when 
both contigs are forward oriented: 

Xi+2Li+dk=Xj 

We get the following relaxed constraints: 

x i +a i +d k -C(l-I k )-C(l-Oi) < xj (3) 

Xi+m+dk+CQ-IjJ+Ca-Oi) > xj (4) 

where C is a large enough constant to relax the constraints if either this edge 
is not included in the solution or the two contigs are reverse oriented. A 
good choice for C is 2L as x\ <L and xj<L. Similarly, we get the relaxed 
constraints for the case where both contigs are reverse oriented (see bottom 
of Fig. 2): 

J9+d*+a/-C(l-7jfc)-C0j < xi (5) 
J9+d*+af+C(l-7jk)+Coi > x t (6) 

The constraints for edges corresponding to the other three cases shown 
in Figure 1 are derived similarly. For each edge, we thus get six constraints 
corresponding to Equations (1-6). In total, we get 6m constraints where m 
is the number of edges in the scaffolding graph. 

When defining the constraints imposed by edge k, we allowed the distance 
dk to vary by C(l —Ik)- Our objective is to minimize the amount by which 
we need to stretch these constraints: 

min^s^C(l-7^) 

k 

where the sum is over all edges in the scaffolding graph. This is equivalent 
to the objective function: 

max^s;t7jt (7) 

k 

where again the sum is over all edges in the scaffolding graph. Note that if 
Ijc were integer- valued, this would correspond to maximizing the number of 
satisfied mate pairs. 

The solution to the MIP problem assigns values to the variables Xi, oi 
and Ik- We then remove from the scaffolding graph those edges that are 
stretched or contracted by >1000bp (i.e. h is not close to 1). Each connected 
component of the graph then forms a scaffold and the variables X{ and Oi give 
the position and orientation of each contig i within its scaffold. 

The above formulation allows two contigs to be positioned in the solution 
so that they overlap. As post-processing, we detect such overlapping contigs 
and remove one of them from the scaffold if the overlap is long and the 
overlapping sequences of the contigs are not similar. 

2.3 Partitioning the scaffolding problem 

Dayarian et al. (2010) noticed that the scaffolding problem can be partitioned 
into subproblems that can be solved independently. If the removal of a contig 
divides the scaffolding graph into two components, then the scaffolding 
can be solved independently for these two components both of which also 
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include the removed node. Since the two parts of the graph only share one 
contig, the independent solutions can be easily combined. In graph theoretical 
terms, this corresponds to dividing the scaffolding graph into its biconnected 
components, and the contigs whose removal disconnects the graph are the 
articulation nodes. 

If there are no errors in mate pairs or their mappings, all contigs longer 
than the insert length of the mate pair library are articulation nodes in the 
scaffolding graph, because no mate pair spans over them. Thus, long contigs 
should divide the other contigs into those that come before them and those 
that come after. In real data, there are chimeric mate pairs and mate pairs 
that map to several locations in the genome. Therefore, not all long contigs 
in real data are articulation nodes. 

When combining the scaffolds from independent solutions, overlaps 
between contigs have to be observed. If combining two solutions would 
cause two contigs with sequence similarity <90% to overlap, we do not 
combine the solutions. Instead we keep one solution as it is and remove the 
contig corresponding to the articulation node from the other solution and 
split that solution into two scaffolds, those contigs that are placed before the 
removed one and those that are placed after. 

3 IMPLEMENTATION 

3.1 General 

MIP Scaff older is implemented as a set of C++ programs. We use Lemon 
graph library (http://lemon.cs.elte.hu/) to implement the scaffolding graph 
and lp_solve (http://lpsolve.sourceforge.net/) to solve the MIP problems. 

As input MIP Scaffolder takes the contigs as FASTA files and the 
alignments of the mate pairs to the contigs as SAM files. MIP Scaffolder 
can use several mate pair libraries with different insert sizes simultaneously. 
The output of MIP Scaffolder is the scaffolds in a FASTA file. 

3.2 Filtering mate-pair mappings 

We first use a read mapper like the readaligner tool by Makinen et al (2010) 
to map the mate pairs to the contigs. For further processing, we use only 
those mate pairs whose both ends map to a unique position in the contig 
collection. 

To improve the quality of the mate pair mappings, we filter the mappings 
so that we only keep mappings that are consistent with other mate pairs 
mapping to approximately the same location. We call a mate pair mapping 
(w,p)- consistent if for both ends of the mate pair the following holds: if a 
mate pair end maps to a contig, then there exists a window of length w on 
this contig such that if in total r mate pair ends map into the same window 
and link the contig to the same direction, then at least p • r of all these mate 
pairs link the same pair of contigs as the original mate pair. We only keep the 
mate pairs mappings that are (w,p)-consistent for some w and p which are 
parameters of our method. Figure 3 shows examples of (w,0.5)-consistent 
and inconsistent mate pair mappings. Consistent mate pair mappings can 
be found by sliding a window of length w on each contig. If the window 
contains r mappings and p • r of these link the same pair of contigs, then the 
p-r mate pair ends are marked consistent. We then scan all the mate pair 
mappings and keep those whose both ends are marked consistent. 

Filtering out mate pairs that are not consistent has two benefits. First, 
it ensures that mate pairs mapping to repeat regions are not used in the 
scaffolding process even if the repeat region is present only once in the contig 
collection. If all repeat regions would be present several times, keeping only 
unique mappings should suffice. Second, the insert length distributions tend 
to have long tails. Mate pairs whose insert length deviates significantly from 
the mean may be filtered out allowing us to estimate the distance between 
contigs more accurately. 

3.3 The scaffold graph 

To estimate the distance d^ associated with an edge k linking two contigs, we 
use edge bundling (Pop et al , 2004) combined with a maximum likelihood 




w, 0.5 



Fig. 3. Examples of (w, 0.5)-consistent and inconsistent mate pair mappings. 
The arrows denote contigs and the short lines mate pairs. The mate pair ends 
are connected with arcs. The mate pairs with solid line arcs connecting them 
are consistent and the mate pairs with dashed line arcs connecting them are 
not consistent. 

method (Gnerre et al, 2011; Simpson et al, 2009). For each mate pair 
library, we estimate the mean insert size g mea n and the insert size range 
[gmin> gmax] and using these values we compute minimum and maximum 
suggested distances between the contigs based on each mate pair. We then 
find a maximum set of mate pairs such that the intersection of the suggested 
distance ranges is a non-empty range. The final distance d^ is then computed 
as a maximum likelihood estimate based on the mean insert size. 

Contigs from repeat regions complicate the scaffolding process, because 
they do not have a unique placement in the genome. Like previous approaches 
(Dayarian et al, 2010), we attempt to recognize such contigs based on 
their high degree in the scaffolding graph or much higher than expected 
coverage and remove them from the scaffolding graph. By default, contigs 
with coverage >2.5 times the average coverage of contigs or degree >50 are 
classified as repeat contigs. Both thresholds can be adjusted by the user. 

3.4 Restricting the size of partitions 

If the biconnected components of the scaffolding graph are large, it is not 
feasible to solve the corresponding MIP problem. Furthermore, solutions to 
MIP problems of such large partitions often place many contigs so that they 
overlap with other contigs even if the overlapping sequences are not similar. 
This complicates the post-processing of the solutions. 

Therefore, we restrict the size of the biconnected components by using 
a suitable subgraph of the scaffolding graph. We use the technique by 
Westbrook and Tarjan (1992) to keep track of the biconnected components 
dynamically as new edges are added to the scaffolding graph. We measure 
the size of a biconnected component by the number of edges. First, we sort 
the potential edges of the scaffolding graph in decreasing order according to 
their support s^. We then add the edges to the graph in this order but only if 
the addition of an edge does not create a too large biconnected component. 
This allows us to demand high support edges for those parts of the genome 
that are well sampled by the mate pair library, while also being able to utilize 
lower support edges for those parts of the genome that have a lower coverage 
in the mate pair library. We show in our experiments that a relatively small 
threshold for the component size yields the best scaffolding results. 

4 RESULTS AND DISCUSSION 

4.1 Mate pair libraries and contigs 

The read sets used to test our approach are summarized in Table 1 . 
The first two read sets are SOLiD reads from the Escherichia coli 
DH10B substrain and the Caernorhabditis Elegans genome. The 
E.coli read set is produced by Applied Biosystems, and we used the 
filtered subset available at http://hts.rutgers.edu/. This set is quality 
filtered and trimmed to 35 bp. We used the mean filter distributed 
with SOPRA to filter the C. elegans read set and kept reads with 
average quality at least 19 and no indeterminate colors. The third 
read set is an Illumina paired end library for Pseudomonas syringae 
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Table 1. The test datasets 



E.coli 



C.elegans 



Rsyringae 



H. sapiens 



Reference organism 
Name 

Genome size (Mbp) 



Escherichia coli 
4.6 



Caenorhabditis elegans 
100.3 



Pseudomonas syringae 
6.1 



Homo sapiens 
3080 



Dataset 

Accession number (SRA/ERA) 



Read length (bp) 
Number of reads 
Coverage 

Spanning coverage 
Insert size 

Sequencing platform 



NA 



35 

2x 7.4M 

HOx 

1930x 

1200 

SOLiD 



SRR033650 



25 

2x 24M 
lOx 
427 x 
1785 
SOLiD 



ERR005143 



36 

2x 3.6M 
42 x 
236x 
400 

Illumina 



SRR067771 
SRR067777 
SRR067781 
SRR067776 
36 

2x 298.6M 

7x 

223 x 

2300 

Illumina 



SRR067773 
SRR067779 
SRR067778 
SRR067786 
36 

2x 342.0M 

8x 

311x 

2800 

Illumina 



The statistics are given for the sets after trimming and filtering. The spanning coverage is the total coverage of the read pairs and the bases in between them. 



and the fourth read set consists of two Illumina mate pair libraries 
for the human genome. The human Illumina mate pair libraries were 
first trimmed to 36 bp and then we filtered out all reads where >30% 
of the bases had quality score < 10. The statistics in Table 1 are for 
the filtered sets. 

We ran the initial contig assembly with Velvet (Zerbino and 
Birney, 2008). The reads in the C.elegans dataset were too short and 
the coverage too low to assemble the reads into contig s. We were 
also not able to assemble contigs for the complete human genome 
because of memory constraints. Instead we used the wgsim tool 
distributed with SAMtools (Li et al, 2009) to generate synthetic 
reads of length 100 from the reference genome and built the contigs 
based on these reads. For the human genome, the reads were 
generated and assembled one chromosome at a time to keep the 
memory requirements on a reasonable level. The coverage of these 
synthetic reads was 30 x for C.elegans. For the human genome, we 
generated paired reads with insert size 500 and 20 x coverage. These 
reads were not made available for scaffolding. 

When running Velvet, we kept contigs > 150 bp. We used the 
following other parameters: 

• E.coli'. hash length 19 and coverage cutoff 6. 

• C.elegans: hash length 61 and coverage cutoff 6. 

• Rsyringae: hash length 21 and coverage cutoff 6. 

• H. sapiens: hash length 57, coverage cutoff 6, and expected 
coverage 20, no scaffolding. 

The E.coli dataset was in color space and we used the contig 
translation of SOPRA to get contigs in base space. For the other 
datasets, the reads used to build the contigs were in base space 
and so we could directly use the contigs produced by Velvet. The 
statistics of the produced contigs are included in Tables 3-6. 

4.2 Evaluation of scaffolds 

The produced scaffolds were compared for the length of the scaffolds 
and for their correctness. For length comparison, we used the N50 
measure as well as the total length of the scaffolds. 



We evaluated the correctness of the scaffolds as follows. Using 
swift by Rasmussen et al. (2006), we searched for maximal 
approximate local matches between the scaffold and the genome 
sequence. With E.coli and Rsyringae, we used parameters 11 for 
the seed length, 0.05 for the maximum error level and 30 for the 
minimum length of a maximal match. With C.elegans, we used 
the same seed length, 0.02 for the maximum error level and 35 
for the minimum length of a maximal match. With the human 
scaffolds, we used 0.01 for the maximum error level, and 100 
for the minimal length of a maximal match. This produced a set 
of tuples each representing a match between a substring range of 
scaffold and substring range of genome. To obtain an alignment 
for the whole scaffold inside the genome, we implemented the 
colinear chaining algorithm (Abouelhoda, 2007) that finds a subset 
of the tuples that retains the linear order in both source (scaffold) 
and target (genome), and maximizes the overall coverage of the 
source (scaffold). Additionally, we restricted the length of gaps in the 
colinear chaining algorithm to 2000 bp on the E.coli and Rsyringae 
datasets and to 5000 bp on the C.elegans and human datasets. This 
way we could obtain a reliable alignment without resorting into 
laborious sequence-level dynamic programming. 

Three measures were computed based on the colinear chaining. 
Genome coverage denotes the proportion of positions inside the 
genome taking part into the alignment. Scaffold coverage denotes 
the proportion of positions inside scaffolds taking part into the 
alignment. We also extracted those parts of the scaffolds that 
took part into the alignment and repeated the colinear chaining 
computation recursively for the remaining part until no parts having 
a maximal local approximate match remained. Finally, we computed 
the N50 length statistic for all extracted parts. We call this N50 
value the normalized N50 statistic. Note that the extracted parts of 
scaffolds do not contain gaps and so the normalized N50 statistic 
should be compared to the N50 value computed on the complete 
scaffolds excluding gaps. 

The validity of scaffolds was additionally measured similarly to 
Gnerre et al. (201 1). Pairs of sequences separated by a given distance 
were extracted randomly from the scaffolds. These pairs were then 



3262 



Fast scaffolding with mixed integer programs 



200000 



150000 



£ 100000 



O 

CD 

E 




O 
a> 
E 



200000 



150000 



15 100000 - 



50000 



0.93 
0.92 
0.91 
0.9 
0.89 
0.88 
0.87 



I I I I I I I 
















w=10, p=0.7 






w=20, p=0.6 


— — X— 


I I 


w=30, p=0.5 
i i i r i 





20 



40 



60 



80 



100 120 



Maximum component size 



w=10, p=0.7 
w=20, p=0.6 
w=30, p=0.5 



20 



40 60 80 100 
Maximum component size 



120 



140 



140 



Fig. 4. The effect of filtering (w,p)-consistent mappings for the E.coli 
dataset. Here we used the maximum component size of 100 edges. 



Fig. 5. The effect of maximum component size for (w,p)-consistent 
mappings for various combinations of w and p for the E.coli dataset. 



mapped to the reference and we measured the proportion of pairs 
whose orientation matched and distance was within 10% of the given 
distance. The validity of contigs was not measured in this fashion 
as they were not long enough. 

4.3 Parameters and runtime of the scaffolders 

For the E.coli, C.elegans and P.syringae datasets, we used 
readaligner (Makinen et al, 2010) to map the mate pairs to contigs 
and for the human dataset we used SOAP2 (Li et al, 2009). For 
the E.coli and P.syringae datasets, the results for SOPRA are for the 
version which is integrated with Velvet and thus directly uses the 
placement of reads as designated by Velvet. We allowed at most two 
edit operations (mismatches and indels) when running readaligner 
and at most three mismatches when running SOAP2. We were able 
to map 91 % of the E.coli reads, 87% of the C.elegans reads and 87% 
of the P.syringae reads. For the human dataset, SOAP2 reported a 
unique match for 70% of the reads. 

We tried several parameter combinations for MIP Scaffolder. 
Figure 4 shows how the N50 value and the genome coverage of 
the scaffolds for the E.coli dataset evolve as we vary w and p when 
determining (w,p)-consistent mappings. The longest scaffolds are 
produced with fairly loose filtering but the loosest filtering also gives 
less accurate scaffolds. Figure 5 shows how the N50 value and the 
genome coverage evolve when the maximum size of the biconnected 
component is varied. Here, we notice that the maximum size of 
the component does not affect the scaffold lengths or correctness 
much. For the analysis of the scaffolds, we used (10,0.7)-consistent 
mappings and maximum component size 100. 

We similarly experimented with different parameters on the 
other datasets and chose those parameters that produced longest 
scaffolds. For the C.elegans dataset, we used (30,0.75)-consistent 
read mapping and maximum component size 50, for the P.syringae 
dataset (30, 0.5)-consistent read mappings and maximum component 



Table 2. Runtimes (in hours) of the scaffolders 





E.coli 


C.elegans 


P.syringae 


H. sapiens 


SOPRA 


0.17 


3.12 


0.03 




SSPACE 






0.02 


9.5 


MIP Scaffolder 


0.03 


0.20 


0.004 


18.8 



size 50 and for the human dataset (50,0.75)-consistent read 
mappings and maximum component size 50. 

We tried several parameters for running SOPRA and show 
the results for those parameters that gave the longest scaffolds 
measured by the N50 value. The E.coli and the P.syringae scaffolds 
were produced with the version of SOPRA that is integrated with 
Velvet, and the C.elegans and human scaffolds were produced using 
pregenerated contigs. For the E.coli dataset, we got best results 
with support threshold (-w) 6. The original paper reports result with 
support threshold 5 but our scaffolds are slightly longer. This might 
be due to the stochastic nature of the algorithm which means that 
even different runs with the same parameters may produce different 
results. For the C.elegans and P.syringae datasets, we got longest 
scaffolds with support thresholds 20 and 4, respectively. We could 
not run SOPRA on the human dataset because it ran out of memory. 

SSPACE does not support mate pair libraries from the SOLiD 
technology and so we ran it only on the P.syringae and human 
datasets. SSPACE was run with default parameters. 

The scaffolders were run on a server with 16 cores operating at 
2.93 GHz and 128 GB of memory. Table 2 shows the runtimes of the 
different scaffolders. We see that MIP Scaffolder is the fastest of the 
scaffolders except on the human dataset. On the human dataset, MIP 
Scaffolder spent half of its time preprocessing the read mappings. 
As further work, we plan to improve the efficiency of this phase. 
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Table 3. The length and validation statistics of scaffolds produced from the E.coli data set by the scaffolders 



Scaffolder 



Number of 
scaffolds 



N50 (bp) 



With gaps No gaps 



Total length (bp) 



With gaps No gaps 



Genome Scaffold Normalized Validity 
coverage coverage N50 (bp) at lOkb 



Contigs 
SOPRA 
MIP Scaffolder 


4341 
168 
207 


1499 

203 090 185 290 
182666 170864 


4 362035 
4 840 884 4 354534 
4558 212 4 327789 


0.922 
0.912 
0.889 


0.998 
0.988 
0.968 


1496 
185 227 
170796 


0.910 
0.972 


When computing 


colinear chaining to obtain genome coverage, scaffold coverage and normalized N50 values, gaps were restricted to at most 2000 bp. 




Table 4. The length and validation statistics of scaffolds produced from the C.elegans dataset by the scaffolders 








Scaffolder 


Number of 
scaffolds 


N50 (bp) 


Total length (bp) 


Genome 
coverage 


Scaffold 
coverage 


Normalized 
N50 (bp) 


Validity 
at lOkb 




With gaps No gaps 


With gaps No gaps 


Contigs 
SOPRA 
MIP Scaffolder 


31419 
17951 
10721 


14717 

132547 130346 
189704 187796 


96249755 
98 595 934 96243 554 
98 273 945 95 406 856 


0.943 
0.945 
0.933 


1.000 
0.999 
0.986 


14717 
130346 
183 891 


0.990 
0.973 


When computing colinear chaining to obtain genome coverage, scaffold coverage and normalized N50 values, gaps were restricted to at most 5000 bp. 




Table 5. The length and validation statistics of scaffolds produced from the Rsyringae dataset by the scaffolders 








Scaffolder 


Number of 
scaffolds 


N50 (bp) 


Total length (bp) 


Genome 
coverage 


Scaffold 
coverage 


Normalized 
N50 (bp) 


Validity 
at 10 kb 




With gaps No gaps 


With gaps No gaps 


Contigs 
SOPRA 
SSPACE 
MIP Scaffolder 


2251 
568 
345 
188 


6972 

75 224 74724 
94 315 93 850 
103 598 103 352 


5 925 601 
5 987 540 5 917490 
5 991990 5 910 839 
5 990318 5 919596 


0.964 
0.958 
0.946 
0.918 


0.981 
0.989 
0.978 
0.949 


6982 
72714 
93 850 
84779 


0.998 
0.984 
0.983 


When computing 


colinear chaining to obtain genome coverage, scaffold coverage and normalized N50 values, ga 


ps were restricted to at most 2000 bp. 




Table 6. The length and validation statistics of scaffolds produced from the H. sapiens dataset by the scaffolders 








Scaffolder 


Number of 
scaffolds 


N50 (bp) 


Total length (bp) 


Genome 


Scaffold 
coverage 


Normalized 
N50 (bp) 


Validity 
at 50 kb 




With gaps No gaps 


With gaps No gaps 


coverage 


Contigs 
SSPACE 
MIP Scaffolder 


349 514 
97 525 
83 909 


18 994 

348 941 348 566 
328 665 325 444 


2785 310070 
2 834193 333 2783 020651 
2 823 814726 2776 643 640 


0.863 
0.640 
0.684 


0.959 
0.719 
0.769 


18185 
179418 
190008 


0.869 
0.870 



When computing colinear chaining to obtain genome coverage, scaffold coverage and normalized N50 values, gaps were restricted to at most 5000 bp. 



4.4 Comparison of scaffolds 

Tables 3-6 show the length and validation statistics of the scaffolds 
produced by SOPRA, SSPACE and MIP Scaffolder for the various 
datasets. We report the scaffold lengths both including gaps and 
without gaps. The length statistics without gaps are more important, 
because the statistics with gaps can vary if the insert length of the 
mate pair library is estimated incorrectly (Pop et al., 2004). 

For the E.coli data, SOPRA produces longest scaffolds that are 
also the most accurate as measured by genome coverage. The 
validity at 10 kb of these scaffolds is low because SOPRA' s estimate 
of the insert size is not accurate. For the C.elegans data, MIP 



Scaffolder produces longest scaffolds that are almost as accurate as 
those produced by SOPRA. On this dataset, we note that scaffolds 
produced by SOPRA have a higher genome coverage than the 
contigs which is likely due to our method maximizing the scaffold 
coverage instead of genome coverage; contigs can be easier aligned 
to the same region in genome than scaffolds. For the Rsyringae 
data, MIP Scaffolder produces longest scaffolds that are not quite 
as accurate as those produced by the other methods. We also note 
that SSPACE performs better than SOPRA on this dataset. For the 
human scaffolds, the coverage figures are quite low for both MIP 
Scaffolder and SSPACE. This is partly due to the strict alignment 
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criteria that was used to make running the validation feasible. Also 
the coverage of the contigs is low for this dataset indicating that 
a larger portion of the contigs are chimeric. A chimeric contig 
incorporated into a scaffold breaks the alignment of the scaffold 
against the reference into two parts of which only the longer one is 
considered in our validation method when computing genome and 
scaffold coverage. Structural variation between individuals may also 
cause lower coverage in the human data as the contigs are built on 
simulated reads from the reference genome, while the mate pairs 
are real data. Also validity at 50 kb is low indicating problems with 
scaffolding. 

5 CONCLUSION 

We have presented MIP Scaffolder which partitions the scaffolding 
problem into subproblems of restricted size and solves these 
subproblems exactly with mixed integer programming. We 
compared MIP Scaffolder to SOPRA and SSPACE on four datasets. 
On two of the datasets, MIP Scaffolder produced longer scaffolds 
that are not quite as accurate as those produced by the other two 
methods. For the human dataset, MIP Scaffolder produced slightly 
shorter but more accurate scaffolds than SSPACE and on the E.coli 
dataset SOPRA outperformed MIP Scaffolder. Our experiments also 
showed that our approach is fast allowing the user to try out different 
parameter combinations easily to optimize for long scaffolds. It is 
also possible to optimize for accurate scaffolds if, for example, 
EST data or the sequence of a close relative species can be used 
to measure the correctness of scaffolding. 
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